<<<<<<< HEAD ======= >>>>>>> 587c566d9e7336d3d0b3977eb67713bcc10701cc <<<<<<< HEAD ======= >>>>>>> 587c566d9e7336d3d0b3977eb67713bcc10701cc <<<<<<< HEAD final_project_redaccion.utf8 ======= >>>>>>> 587c566d9e7336d3d0b3977eb67713bcc10701cc

friends-tv-series-font

friends-tv-series-font



friends-tv-series-font


<<<<<<< HEAD =======

The code for this analysis is published in a public Git Hub repository.

>>>>>>> 587c566d9e7336d3d0b3977eb67713bcc10701cc

I. Introduction

i) Friends iconic sitcom

Friends is an American situation comedy, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast starring Jennifer Aniston (Rachel), Courteney Cox (Monica), Lisa Kudrow (Phoebe), Matt LeBlanc (Joey), Matthew Perry (Chandler) and David Schwimmer (Ross).

The show revolved around six friends in their 20s and 30s who lived in Manhattan, New York City. Rachel Green, a sheltered but friendly woman, flees her wedding day and her rich yet unfulfilling life, and finds childhood friend Monica Geller, a tightly-wound but caring chef. After Rachel becomes a waitress at coffee house Central Perk, she and Monica become roommates at Monica’s apartment located directly above Central Perk, and Rachel joins Monica’s group of single people in their mid-20s: her previous roommate Phoebe Buffay, an eccentric, innocent masseuse; her neighbor across the hall Joey Tribbiani, a dim-witted yet loyal struggling actor and womanizer; Joey’s roommate Chandler Bing, a sarcastic, self-deprecating IT manager; and her older brother and Chandler’s college roommate Ross Geller, a sweet-natured but insecure paleontologist.

Friends received positive reviews throughout its run and became one of the most popular sitcoms of its time. The series won many awards and was nominated for 63 Primetime Emmy Awards. The series was also very successful in the ratings, consistently ranking in the top ten in the final primetime ratings. Friends has made a large cultural impact, and has become an the model to follow for sitcoms.

ii) Motivation & Questions

As teenagers at the beginning of the century, we were heavily influenced by the Friends phenomenon and became huge fans of the sitcom. We decided to work on this project to challenge through a data analysis our preconceptions of the show and discover hiden insights. The questions that guide our quantitative assessment are the following:

  • Can we categorize by importance all the appearing characters of the sitcom? This question at first glance could seem simple but under the assumption that we do not possess any previous knowledge of the sitcom and considering that over the ten seasons more than 800 characters appeared in the show, the analysis represents a challenge.

  • Can we identify and quantify the interactions between the main and secondary characters? What would be an appropriate way to quantify and visualize these relationships?

  • Which are the most recurrent topics through the seasons and episodes of the show? And how the thematic of the show evolved over its ten seasons? Can we extract this information through the dialogues of the show?

  • Can we determine the contribution of each character to the popularity of the sitcom? Does the participation of each character influence the viewer’s preferences?

iii) R Libraries, Machine Learning techniques & Other resources

We have use the next R libraries for the development of this project :

II. Data sources

i) Primary sources

The primary data sources that we used for our project and that we consider that have an adequate quality are:

  • Transcripts: For the transcripts, we used an open resource built by fans of the sitcom and that has been compiled in a Github repository. The repository contains all the dialogues of the characters for the 231 episodes of the tv-show. The data is organized in Html documents.. The data can be accessed via: https://fangj.github.io/friends/. If you want to see how de transcripts are originally presented please click here.

  • Ratings: For the ratings, we have used the IMDb Datasets which is available for access to customers for personal and non-commercial use. The data is structured in seven compressed CSV files that contain general information of the show (genre, start year, end year, episode duration, etc.), and specific information of each episode (title, rating, characters, crew, etc.). A relevant characteristic of the database is that it is refreshed daily. We have made the consultation of the Data on November 10, 20199. The data can be accessed via: https://datasets.imdbws.com/

ii) Data quality and challenges

  • IMDb Dataset:

    • The first obstacle that we faced with the IMDb datasets was the size of the data sets, some of them have millions of rows with the information of Tv-series, shorts, movies, documentaries, and other entertainment formats. It was impossible to store them in our Github.

    • The second obstacle was to track, which was the data corresponding to our case of study. For example, we searched in the dataset only by name ‘Friends’ we found 178 results of TV-series or movies called ‘Friends’. It was necessary to understand and do some research on the years of beginning and end of the series to refine the search.

    • Another obstacle was that the ID for TV-series across the seven IMDb datasets was not uniform. For example, in the dataset corresponding to the titles of the TV-series, the ID to identify the show is named “tconst”, while on the dataset that where we can get the ID of the episode the name correspond to the ID of the episode, and the ID for the TV-series is called “parentTconst”. These errors were identified through the exploration of the datasets.

  • Transcipt Dataset:

      <<<<<<< HEAD
    • The main obstacle of the dialogue dataset is that not all the HTML files share the same format. In our first, we were not able to extract the dialogues of 15 episodes. We have overcome this difficulty by incorporating special cases in our scraping code that took into account the special cases that we have detected.

    • The second difficulty that we have experienced in the dialogue dataset is the cleaning of the dialogues itself. We tried to standardize as most as possible the content of the dialogues, by identifying different names for the same character, common typos and regular expressions that could hinder our analysis.

=======
  • The main obstacle of the dialogue dataset is that not all the HTML files share the same format. We have overcome this difficulty by incorporating special cases in our scraping code that took into account the special cases that we have detected.

  • The second difficulty that we have experienced in the dialogue dataset is the cleaning of the dialogues itself. We tried to standardize as most as possible the content of the dialogues, by identifying different names for the same character, common typos and regular expressions that could hinder our analysis.

  • You can follow the scraping code that lead to the folowing data frame by looking into “frieds.Rmd” file in the Git Hub repository.

    url <- "https://fangj.github.io/friends/"
    paths_allowed(url)
    ## # A tibble: 6 x 5
    ##   episode_id line_num scene character line                                 
    ##   <chr>         <dbl> <dbl> <chr>     <chr>                                
    ## 1 1 : 01            1     1 MONICA    There's nothing to tell! He's just s…
    ## 2 1 : 01            2     1 JOEY      C'mon, you're going out with the guy…
    ## 3 1 : 01            3     1 CHANDLER  All right Joey, be nice. So does he …
    ## 4 1 : 01            4     1 PHOEBE    Wait, does he eat chalk?             
    ## 5 1 : 01            5     1 PHOEBE    Just, 'cause, I don't want her to go…
    ## 6 1 : 01            6     1 MONICA    Okay, everybody relax. This is not e…
    >>>>>>> 587c566d9e7336d3d0b3977eb67713bcc10701cc

    III. Data transformation

    <<<<<<< HEAD

    The objective of the data transformation stage was to extract the information from our 2 data sources, IMDb and Github repository, and built a unique Metabase that constitued our main dataframe to carry out our quantitative analyisis of Friends and generate suitable graphics. To achive our objetive we have followed the next steps:

    ## 'data.frame':    61264 obs. of  8 variables:
    ##  $ episode_id: chr  "1 : 01" "1 : 01" "1 : 01" "1 : 01" ...
    ##  $ line_num  : int  1 2 3 4 5 6 7 8 9 10 ...
    ##  $ scene     : num  1 1 1 1 1 1 1 2 2 2 ...
    ##  $ character : chr  "MONICA" "JOEY" "CHANDLER" "PHOEBE" ...
    ##  $ line      : chr  "There's nothing to tell! He's just some guy\nI work with!" "C'mon, you're going out with the guy! There's\ngotta be something wrong with him!" "All right Joey, be\nnice.  So does he have a hump? A hump and a hairpiece?" "Wait, does he eat chalk?" ...
    ##  $ words     : int  10 13 15 5 15 20 6 21 5 10 ...
    ##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
    ##  $ episode   : int  1 1 1 1 1 1 1 1 1 1 ...
      + Step 1.2: Extract, decompress and save as dataframes. Tha 3 tables of IMDb that we have used in our analysis are those that allowed us to extract the information related to the rating of each episode.
      

    title.ratings.tsv.gz

    ## 'data.frame':    990485 obs. of  3 variables:
    ##  $ tconst       : Factor w/ 990485 levels "tt0000001","tt0000002",..: 1 2 3 4 5 6 7 8 9 10 ...
    ##  $ averageRating: num  5.6 6.1 6.5 6.2 6.1 5.2 5.5 5.4 5.4 6.9 ...
    ##  $ numVotes     : int  1547 187 1204 114 1932 102 615 1663 81 5539 ...

    name.basics.tsv.gz

    ## 'data.frame':    6302902 obs. of  9 variables:
    ##  $ tconst        : Factor w/ 6302902 levels "tt0000001","tt0000002",..: 1 2 3 4 5 6 7 8 9 10 ...
    ##  $ titleType     : Factor w/ 10 levels "movie","short",..: 2 2 2 2 2 2 2 2 1 2 ...
    ##  $ primaryTitle  : Factor w/ 3167092 levels "''The Shot''",..: 487397 1578497 2044003 2943478 392618 532829 596087 814026 1795571 955903 ...
    ##  $ originalTitle : Factor w/ 3182719 levels "''The Shot''",..: 487809 1594160 2064684 2956049 393113 533655 596985 817458 1814307 1564442 ...
    ##  $ isAdult       : Factor w/ 29 levels "\\N","0","1",..: 2 2 2 2 2 2 2 2 2 2 ...
    ##  $ startYear     : Factor w/ 148 levels "\\N","1874","1878",..: 14 12 12 12 13 14 14 14 14 15 ...
    ##  $ endYear       : Factor w/ 113 levels "\\N","11","12",..: 1 1 1 1 1 1 1 1 1 1 ...
    ##  $ runtimeMinutes: Factor w/ 863 levels "\\N","0","1",..: 3 559 447 1 3 3 3 3 505 3 ...
    ##  $ genres        : Factor w/ 2236 levels "","\\N","Action",..: 1519 836 696 836 1215 2211 2212 1519 2178 1519 ...

    title.episode.tsv.gz

    ## 'data.frame':    4425501 obs. of  4 variables:
    =======
    

    i) Building a Metabase

    • Step 1:
      • Step 1.1: After scraping the dialogues data we had to do some data cleaning and transformation to create a data frame with the name of the character, dialogue line, scene, episode and words count.
      First we added a word count to dialogues.
    ## # A tibble: 6 x 2
    ##   line                                                                words
    ##   <chr>                                                               <int>
    ## 1 There's nothing to tell! He's just some guy I work with!               11
    ## 2 C'mon, you're going out with the guy! There's gotta be something w…    14
    ## 3 All right Joey, be nice. So does he have a hump? A hump and a hair…    16
    ## 4 Wait, does he eat chalk?                                                5
    ## 5 Just, 'cause, I don't want her to go through what I went through w…    16
    ## 6 Okay, everybody relax. This is not even a date. It's just two peop…    21

    We can see that some episodes where put together in the same file:

    ## [1] "2 : 12-13"  "6 : 15-16"  "9 : 23-24"  "10 : 17-18"

    Now we split those episodes into two different ones:

    For more clarity, we will add season and episode columns.

    Now we have to correct some character names that had typos and we removed some lines that the scraping code catched that are not dialogues.

    ## # A tibble: 6 x 8
    ##   episode_id line_num scene character line             words season episode
    ##   <chr>         <dbl> <dbl> <chr>     <chr>            <int>  <int>   <int>
    ## 1 1 : 01            1     1 MONICA    There's nothing…    11      1       1
    ## 2 1 : 01            2     1 JOEY      C'mon, you're g…    14      1       1
    ## 3 1 : 01            3     1 CHANDLER  All right Joey,…    16      1       1
    ## 4 1 : 01            4     1 PHOEBE    Wait, does he e…     5      1       1
    ## 5 1 : 01            5     1 PHOEBE    Just, 'cause, I…    16      1       1
    ## 6 1 : 01            6     1 MONICA    Okay, everybody…    21      1       1

    More data transformation will be used and explained in each of the Results subsections.

      + Step 1.2: Extract, decompress and save as dataframes. Tha 3 tables of IMDb that we have used in our analysis are those that allowed us to extract the information related to the rating of each episode.
      

    title.ratings.tsv.gz

    ## 'data.frame':    990485 obs. of  3 variables:
    ##  $ tconst       : Factor w/ 990485 levels "tt0000001","tt0000002",..: 1 2 3 4 5 6 7 8 9 10 ...
    ##  $ averageRating: num  5.6 6.1 6.5 6.2 6.1 5.2 5.5 5.4 5.4 6.9 ...
    ##  $ numVotes     : int  1547 187 1204 114 1932 102 615 1663 81 5539 ...

    title.episode.tsv.gz

    ## 'data.frame':    4425501 obs. of  4 variables:
    >>>>>>> 587c566d9e7336d3d0b3977eb67713bcc10701cc
    ##  $ tconst       : Factor w/ 4425501 levels "tt0041951","tt0042816",..: 1 2 3 4 5 6 7 8 9 10 ...
    ##  $ parentTconst : Factor w/ 128694 levels "tt0038276","tt0039122",..: 59 34810 34810 22 34810 34810 34810 34274 34810 34810 ...
    ##  $ seasonNumber : Factor w/ 244 levels "\\N","1","10",..: 2 2 1 131 103 103 131 2 131 179 ...
    ##  $ episodeNumber: Factor w/ 15556 levels "\\N","0","1",..: 14445 6320 1 9103 6209 13326 7769 11103 9547 1115 ...
    • Step 2: Join the 3 data frames of IMDb to create an intermediate dataframe.Notice that this dat frame will contain the average rating of IMDb per episode. Furthermore, we created a suitable key to join this dataframe with the dataframe of that contain the dialogues.
    <<<<<<< HEAD
    ## 'data.frame':    236 obs. of  9 variables:
    ##  $ parentTconst : chr  "tt0108778" "tt0108778" "tt0108778" "tt0108778" ...
    ##  $ titleType    : Factor w/ 10 levels "movie","short",..: 6 6 6 6 6 6 6 6 6 6 ...
    ##  $ primaryTitle : Factor w/ 3167092 levels "''The Shot''",..: 1051590 1051590 1051590 1051590 1051590 1051590 1051590 1051590 1051590 1051590 ...
    =======
    
    ## 'data.frame':    236 obs. of  9 variables:
    ##  $ parentTconst : chr  "tt0108778" "tt0108778" "tt0108778" "tt0108778" ...
    ##  $ titleType    : Factor w/ 1 level "tvSeries": 1 1 1 1 1 1 1 1 1 1 ...
    ##  $ primaryTitle : Factor w/ 1 level "Friends": 1 1 1 1 1 1 1 1 1 1 ...
    >>>>>>> 587c566d9e7336d3d0b3977eb67713bcc10701cc
    ##  $ tconst       : chr  "tt0583431" "tt0583432" "tt0583433" "tt0583434" ...
    ##  $ seasonNumber : Factor w/ 244 levels "\\N","1","10",..: 212 3 3 3 223 3 190 201 103 190 ...
    ##  $ episodeNumber: Factor w/ 15556 levels "\\N","0","1",..: 13326 14445 6320 6431 3 3 3 3 2226 7769 ...
    ##  $ averageRating: num  8.2 8.6 9.5 9.7 8.7 8.5 8.9 8.7 8.6 8.8 ...
    ##  $ numVotes     : int  2568 2641 5829 9699 2783 2889 3376 2962 3472 3100 ...
    ##  $ episode_id   : chr  "7 : 08" "10 : 09" "10 : 17" "10 : 18" ...
    • Step 3: With a Left-Join we created the Metabase that had as primary dataframe the dialogues and as secondary dataframe the ratimgs from IMDb.
    <<<<<<< HEAD
    ## 'data.frame':    61264 obs. of  12 variables:
    ##  $ episode_id   : chr  "1 : 01" "1 : 01" "1 : 01" "1 : 01" ...
    ##  $ line_num     : int  1 2 3 4 5 6 7 8 9 10 ...
    ##  $ scene        : num  1 1 1 1 1 1 1 2 2 2 ...
    ##  $ character    : chr  "MONICA" "JOEY" "CHANDLER" "PHOEBE" ...
    ##  $ line         : chr  "There's nothing to tell! He's just some guy\nI work with!" "C'mon, you're going out with the guy! There's\ngotta be something wrong with him!" "All right Joey, be\nnice.  So does he have a hump? A hump and a hairpiece?" "Wait, does he eat chalk?" ...
    ##  $ words        : int  10 13 15 5 15 20 6 21 5 10 ...
    =======
    
    ## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 61264 obs. of  12 variables:
    ##  $ episode_id   : chr  "1 : 01" "1 : 01" "1 : 01" "1 : 01" ...
    ##  $ line_num     : num  1 2 3 4 5 6 7 8 9 10 ...
    ##  $ scene        : num  1 1 1 1 1 1 1 2 2 2 ...
    ##  $ character    : chr  "MONICA" "JOEY" "CHANDLER" "PHOEBE" ...
    ##  $ line         : chr  "There's nothing to tell! He's just some guy I work with!" "C'mon, you're going out with the guy! There's gotta be something wrong with him!" "All right Joey, be nice. So does he have a hump? A hump and a hairpiece?" "Wait, does he eat chalk?" ...
    ##  $ words        : int  11 14 16 5 16 21 6 22 5 11 ...
    >>>>>>> 587c566d9e7336d3d0b3977eb67713bcc10701cc
    ##  $ season       : int  1 1 1 1 1 1 1 1 1 1 ...
    ##  $ episode      : int  1 1 1 1 1 1 1 1 1 1 ...
    ##  $ parentTconst : chr  "tt0108778" "tt0108778" "tt0108778" "tt0108778" ...
    ##  $ tconst       : chr  "tt0583459" "tt0583459" "tt0583459" "tt0583459" ...
    ##  $ averageRating: num  8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ...
    ##  $ numVotes     : int  6098 6098 6098 6098 6098 6098 6098 6098 6098 6098 ...
    <<<<<<< HEAD

    V. Results

    i) Labeling the main and secondary characters and its participation

    • The first big question that we want to explore is the categorization by importance by the participation of the characters? As we have previously mentioned this question seems to be naive, however, if we assume that we do not have any previous knowledge of the sitcom and considering that over the ten seasons more than 800 characters appeared in the show, the analysis is far from being a naive exercise.

    To answer this question we have used the unsupervised Machine Learning technique of Kmean. Its objective s to label the data based on certain characteristics, in this case, we used the number of words, lines, and scenes. To accomplish this task we have used the libraries cluster, and base. Moreover, we have established established a priori the desired number of labels tha we wanted for our data, for practicity terms we decides to set the size of the groups or kmeans.

    From the kmean analysis obtain the following separation of characters: + Main Characters: As expected Rachel, Monica, Phoebe, Joey, Chandler and Ross constitue one group tha has that on average has 1,680 scenes, 8469 lines and 87,498 words. + Secondary Characters: This group is composed by 33 characters, most of them are recurrent characters and guest stars. The average character in this group has on average 35 scenes, 133 lines and 1,228 words.

    • Other Characters: composed for those characters that are incidental or did not have a relevant importance in the sitcom. The average character in this group has 2 scenes, 7 lines and 64 words.

    Centers:

    ##   Total_scene Total_lines Total_words vcluster
    ## 1 1679.833333 8469.333333 87498.66667        1
    ## 2   35.484848  133.636364  1228.57576        3
    ## 3    2.480723    7.274699    64.11446        2

    Interactivity Scatter Plot

    Shiny applications not supported in static R Markdown documents

    ii) Unraveling the character interactions

    For the Network analysis, a special data structure is required. We established a definition of interaction between characters when they share the same scene. We must mention that the original data structure of the dialogues does not permit us to identify the exact interaction of the characters in the scene. Hence, we have assumed that all the characters that appeared in every scene interacted between them. Moreover, we have assumed that the interactions between the characters will be represented by an adjacency matrix where we can observe the number of interactions that each character has with the others.

    With the library igraph) we were able to create the adjecency matrix of the characters and quantify the interactions among the 869 characters.

    Interactivity Networks

    iii) Extracting the main topics

    iv) Rating contribution per character

    =======

    ii) Data structure for Network analysis

    For the Network analysis a special data structure is required. We established the a defition of interaction between characters when they share scene. We must mention that the data estructure does not make possible to identify the exact interaction of the characters in the scene. Hence, we have made the assumption that all the characters that appeared in eavery scene interacted between them. Moreover, we have assumed that the interactions between the characters will be represented by a adjecency matrix where we can observe

    IV. Missing values

    To search for missing values we look at the number of missing values per columns in dialouges.

    ##    episode_id      line_num         scene     character          line 
    ##             0             0             0             0            61 
    ##         words        season       episode  parentTconst        tconst 
    ##             0             0             0             0             0 
    ## averageRating      numVotes 
    ##             0             0

    There appear to be some missing values for line column. We will use visna from extracats library to see the pattern of missing values.

    This missing values are due to a different formats in the Git Hub page used for web scraping. For example, lines like “Paolo: (something in Italian)” render a NA line because the scraping code removes everything between parenthesis.

    We decided to fill those missing values with “”. By doing so we will keep the register for those characters dialogue.

    V. Results

    Labeling the main and secondary characters and its participation

    • Main characters participation

    Friends is a TV show that tells the story of a group of six friends: Monica, Rachel, Phoebe, Chandler, Ross and Joey. Is one of these characters more important than others? We try to answer this question by looking at the number of lines for each of these main characters.

    We can see that Rachel is the character with more lines and Phoebe is the character with less lines. Now we focus in the number of words instead of the number of lines.

    Rachel and Ross are again the characters that speak the most and Phoebe the one with less words. We can see that Monica was number 3 for number lines but she is number 5 for number of words. This suggests that Monica’s lines tend to be shorter. The opposite happens with Joey. He is number 5 for number of lines, but he is third for number of words. This suggests his lines tend to be longer.

    By looking into lines per episode distribution we find the following: * Monica’s distribution looks more narrow that the others. This indicates that there are few episodes in which Monica speaks a lot. * Chandler and Ross have large right tails, we infer that those characters have episodes in which they speak a lot. * Rachel and Ross have wider distributions.

    Unraveling the character interactions

    Topic modelling using LDA

    For topic modelling we will use the package textmineR. We will try to find the topic for each episode. To do so we will create a document for each episode, so we have to group lines by episode_id.

    ## # A tibble: 6 x 2
    ##   episode_id lines                                                         
    ##   <chr>      <chr>                                                         
    ## 1 1 : 01     There's nothing to tell! He's just some guy I work with! C'mo…
    ## 2 1 : 02     What you guys don't understand is, for us, kissing is as impo…
    ## 3 1 : 03     Hi guys! Hey, Pheebs! Hi! Hey. Oh, oh, how'd it go? Um, not s…
    ## 4 1 : 04     "Alright. Phoebe? Okay, okay. If I were omnipotent for a day,…
    ## 5 1 : 05     "Would you let it go? It's not that big a deal. Not that big …
    ## 6 1 : 06     Ooh! Look! Look! Look! Look, there's Joey's picture! This is …

    Function CrateDtm creates a document term matrix. To do so we use a group of stopwords, words we don’t want to use because they are used frequently in English language and do not give insightful information.

    We will use document term matrix to create a Term Document Frequency matrix that counts the number of times a term appears (term frequency) and the number of documents in which a term appears (document frequency).

    These are the main terms ordered by term frequency:

    ##        term term_freq doc_freq
    ## 11509  good      1714      231
    ## 11508   god      1677      228
    ## 11507  guys      1468      225
    ## 11506 great      1342      225
    ## 11505  time      1215      229
    ## 11504  back      1125      223

    Now we fit a Latent Dirichlet allocation model in which we will try to fit 15 topics into the collection of episodes. This will return to main matrices:

    • theta: Matrix with the probability of topic per document -> P(topic | document).
    • phi: Matrix with the probability of term per topic -> P(term | topic).
    ## [1] "Theta:"
    ##                t_1          t_2          t_3         t_4        t_5
    ## 1 : 01 0.003959440 0.2482858522 0.0348623853 0.069628199 0.08025109
    ## 1 : 02 0.017833456 0.0281503316 0.0325718497 0.237435520 0.03109801
    ## 1 : 03 0.011225296 0.0001581028 0.1693280632 0.003320158 0.02387352
    ## 1 : 04 0.004108681 0.0014579192 0.0557985421 0.139297548 0.09158383
    ## 1 : 05 0.007375271 0.0001446132 0.0001446132 0.111496746 0.11005061
    ## 1 : 06 0.028275352 0.0031088083 0.0549222798 0.001628423 0.39393042
    ## [1] "Phi:"
    ##          met_guy       gellar meeting_meeting cameras_smell    potpourri
    ## t_1 8.583028e-06 8.583028e-06    8.583028e-06  8.583028e-06 8.583028e-06
    ## t_2 8.660333e-06 8.660333e-06    8.660333e-06  1.818670e-04 8.660333e-06
    ## t_3 5.786066e-06 5.786066e-06    5.786066e-06  5.786066e-06 5.786066e-06
    ## t_4 9.490457e-06 9.490457e-06    9.490457e-06  9.490457e-06 9.490457e-06
    ## t_5 6.904695e-06 6.904695e-06    6.904695e-06  6.904695e-06 6.904695e-06
    ## t_6 9.590578e-06 9.590578e-06    9.590578e-06  9.590578e-06 9.590578e-06

    Now the 15 topics have been created. To know about the topics quality we look into the topic coherence, this is a measure of how associated are words in a topic.

    ##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    ## -0.002827  0.003716  0.019262  0.035778  0.056138  0.115704

    We will use phi to get the top 5 terms per topic.

    ##      [,1]      [,2]           [,3]       [,4]        [,5]     
    ## t_1  "sister"  "meet"         "give"     "part"      "honey"  
    ## t_2  "wedding" "married"      "guys"     "love"      "parents"
    ## t_3  "guys"    "game"         "money"    "apartment" "move"   
    ## t_4  "janice"  "carol"        "guy"      "woman"     "susan"  
    ## t_5  "guys"    "big"          "bye"      "julie"     "listen" 
    ## t_6  "emma"    "mike"         "guys"     "baby"      "love"   
    ## t_7  "monkey"  "marcel"       "people"   "joke"      "drake"  
    ## t_8  "dad"     "birthday"     "mom"      "guys"      "party"  
    ## t_9  "baby"    "ring"         "pregnant" "guys"      "god"    
    ## t_10 "job"     "guy"          "guys"     "great"     "good"   
    ## t_11 "cat"     "thing"        "mark"     "god"       "love"   
    ## t_12 "guys"    "love"         "good"     "plane"     "bob"    
    ## t_13 "guys"    "thanksgiving" "year"     "dog"       "school" 
    ## t_14 "good"    "god"          "time"     "great"     "wait"   
    ## t_15 "emily"   "married"      "love"     "london"    "pheebs"

    The next step is to compute the topic prevalence using theta. Topic prevalence indicate the most frequent topics in the TV show.

    Finally, we get a summary for the complete LDA model.

    ##      topic coherence prevalence                             top_terms
    ## t_14  t_14     0.000     40.628          good, god, time, great, wait
    ## t_10  t_10     0.003      6.197           job, guy, guys, great, good
    ## t_3    t_3     0.009      5.902    guys, game, money, apartment, move
    ## t_5    t_5    -0.003      5.432         guys, big, bye, julie, listen
    ## t_8    t_8     0.064      4.690       dad, birthday, mom, guys, party
    ## t_9    t_9     0.019      4.507       baby, ring, pregnant, guys, god
    ## t_1    t_1     0.004      4.174       sister, meet, give, part, honey
    ## t_4    t_4     0.061      4.097      janice, carol, guy, woman, susan
    ## t_2    t_2     0.050      3.800 wedding, married, guys, love, parents
    ## t_15  t_15     0.116      3.640  emily, married, love, london, pheebs
    ## t_7    t_7     0.052      3.601   monkey, marcel, people, joke, drake
    ## t_11  t_11     0.006      3.494           cat, thing, mark, god, love
    ## t_13  t_13     0.051      3.460 guys, thanksgiving, year, dog, school
    ## t_6    t_6     0.108      3.349          emma, mike, guys, baby, love
    ## t_12  t_12    -0.002      3.028          guys, love, good, plane, bob

    We can see that the most prevalent (frequent) topic has words like “good”, “god”,“great”, “time”. This makes sense, this words are very frequent in the TV show and that is why they give very little information about the topic. That is why the coherence is 0.0.

    The other topics in the model have less prevalence but they are more coherent. If you are a fan of the show and if you read the list of top terms, we are sure you can remember episodes in which those terms were important.

    To find those important episodes we created a d3 tool. We wrote a csv file using theta in which, for each episode and topic we put the probability of that topic given the episode and the top terms of that topic.

    ##       id topic        value topic_num
    ## 1 1 : 01   t_1 0.0039594399         1
    ## 2 1 : 01  t_14 0.3690004829        14
    ## 3 1 : 01   t_6 0.0000965717         6
    ## 4 1 : 01  t_15 0.0020280058        15
    ## 5 1 : 01  t_13 0.0242394978        13
    ## 6 1 : 01   t_7 0.0551424433         7
    ##                               top_terms                   name
    ## 1       sister, meet, give, part, honey Monica Gets A Roommate
    ## 2          good, god, time, great, wait Monica Gets A Roommate
    ## 3          emma, mike, guys, baby, love Monica Gets A Roommate
    ## 4  emily, married, love, london, pheebs Monica Gets A Roommate
    ## 5 guys, thanksgiving, year, dog, school Monica Gets A Roommate
    ## 6   monkey, marcel, people, joke, drake Monica Gets A Roommate
    Friends episodes topics

    Top episodes by topic

    Click the circles to see the episodes with highest P(topic | episode).

    Rating contribution per character

    >>>>>>> 587c566d9e7336d3d0b3977eb67713bcc10701cc

    VI. Interactive component

    <<<<<<< HEAD =======

    We included four elements that are interactive:

    >>>>>>> 587c566d9e7336d3d0b3977eb67713bcc10701cc

    VII. Conclusion